AITopics | document pair

Collaborating Authors

document pair

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

A NeurIPS Datasets and Benchmark Checklist

Neural Information Processing SystemsFeb-18-2026, 02:00:39 GMT

If it is a counter (e.g.

large language model, machine learning, natural language, (23 more...)

Neural Information Processing Systems

Country:

North America > United States > Massachusetts (0.04)
North America > United States > Florida > Martin County > Stuart (0.04)

Genre: Research Report (1.00)

Industry:

Law (1.00)
Information Technology (1.00)
Banking & Finance > Real Estate (0.93)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.95)
(2 more...)

Add feedback

Structured Document Translation via Format Reinforcement Learning

Song, Haiyue, Eschbach-Dymanus, Johannes, Kaing, Hour, Honda, Sumire, Tanaka, Hideki, Buschbeck, Bianka, Utiyama, Masao

arXiv.org Artificial IntelligenceDec-5-2025

Recent works on structured text translation remain limited to the sentence level, as they struggle to effectively handle the complex document-level XML or HTML structures. To address this, we propose \textbf{Format Reinforcement Learning (FormatRL)}, which employs Group Relative Policy Optimization on top of a supervised fine-tuning model to directly optimize novel structure-aware rewards: 1) TreeSim, which measures structural similarity between predicted and reference XML trees and 2) Node-chrF, which measures translation quality at the level of XML nodes. Additionally, we apply StrucAUC, a fine-grained metric distinguishing between minor errors and major structural failures. Experiments on the SAP software-documentation benchmark demonstrate improvements across six metrics and an analysis further shows how different reward functions contribute to improvements in both structural and translation quality.

large language model, machine learning, translation, (19 more...)

arXiv.org Artificial Intelligence

2512.051

Country:

Europe (1.00)
North America > United States (0.67)
Asia > Middle East > UAE (0.14)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

A NeurIPS Datasets and Benchmark Checklist

Neural Information Processing SystemsOct-9-2025, 12:25:42 GMT

If it is a counter (e.g.

large language model, machine learning, natural language, (24 more...)

Neural Information Processing Systems

Country:

North America > United States > Massachusetts (0.04)
North America > United States > Florida > Martin County > Stuart (0.04)

Genre: Research Report (1.00)

Industry:

Law (1.00)
Information Technology (1.00)
Banking & Finance > Real Estate (0.93)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.95)
(2 more...)

Add feedback

Overview of the Plagiarism Detection Task at PAN 2025

Greiner-Petter, André, Fröbe, Maik, Wahle, Jan Philip, Ruas, Terry, Gipp, Bela, Aizawa, Akiko, Potthast, Martin

arXiv.org Artificial IntelligenceOct-9-2025

The generative plagiarism detection task at PAN 2025 aims at identifying automatically generated textual plagiarism in scientific articles and aligning them with their respective sources. We created a novel large-scale dataset of automatically generated plagiarism using three large language models: Llama, DeepSeek-R1, and Mistral. In this task overview paper, we outline the creation of this dataset, summarize and compare the results of all participants and four baselines, and evaluate the results on the last plagiarism detection task from PAN 2015 in order to interpret the robustness of the proposed approaches. We found that the current iteration does not invite a large variety of approaches as naive semantic similarity approaches based on embedding vectors provide promising results of up to 0.8 recall and 0.5 precision. In contrast, most of these approaches underperform significantly on the 2015 dataset, indicating a lack in generalizability.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2510.06805

Country:

Asia (1.00)
North America > United States (0.68)
Europe > Germany > Lower Saxony (0.28)

Genre: Research Report (0.50)

Industry: Education > Educational Technology > Educational Software > Computer-Aided Assessment (0.85)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

DocHPLT: A Massively Multilingual Document-Level Translation Dataset

O'Brien, Dayyán, Malik, Bhavitvya, de Gibert, Ona, Chen, Pinzhen, Haddow, Barry, Tiedemann, Jörg

arXiv.org Artificial IntelligenceOct-1-2025

Existing document-level machine translation resources are only available for a handful of languages, mostly high-resourced ones. To facilitate the training and evaluation of document-level translation and, more broadly, long-context modeling for global communities, we create DocHPLT, the largest publicly available document-level translation dataset to date. It contains 124 million aligned document pairs across 50 languages paired with English, comprising 4.26 billion sentences. By adding pivoted alignments, practitioners can obtain 2500 additional pairs not involving English. Unlike previous reconstruction-based approaches that piece together documents from sentence-level data, we modify an existing web extraction pipeline to preserve complete document integrity from the source, retaining all content, including unaligned portions. After our preliminary experiments identify the optimal training context strategy for document-level translation, we demonstrate that LLMs fine-tuned on DocHPLT substantially outperform off-the-shelf instruction-tuned baselines, with particularly dramatic improvements for under-resourced languages. We open-source the dataset under a permissive license, providing essential infrastructure for advancing multilingual document-level translation.

artificial intelligence, natural language, translation, (14 more...)

arXiv.org Artificial Intelligence

2508.13079

Country: Europe (0.93)

Genre: Research Report > New Finding (0.46)

Industry: Education (0.48)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Speech Vecalign: an Embedding-based Method for Aligning Parallel Speech Documents

Meng, Chutong, Koehn, Philipp

arXiv.org Artificial IntelligenceSep-24-2025

We present Speech Vecalign, a parallel speech document alignment method that monotonically aligns speech segment embeddings and does not depend on text transcriptions. Compared to the baseline method Global Mining, a variant of speech mining, Speech Vecalign produces longer speech-to-speech alignments. It also demonstrates greater robustness than Local Mining, another speech mining variant, as it produces less noise. We applied Speech Vecalign to 3,000 hours of unlabeled parallel English-German (En-De) speech documents from VoxPopuli, yielding about 1,000 hours of high-quality alignments. We then trained En-De speech-to-speech translation models on the aligned data. Speech Vecalign improves the En-to-De and De-to-En performance over Global Mining by 0.37 and 0.18 ASR-BLEU, respectively. Moreover, our models match or outperform SpeechMatrix model performance, despite using 8 times fewer raw speech documents.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2509.1836

Country:

Europe (1.00)
Asia > Middle East > UAE (0.46)
North America > United States > Minnesota (0.28)

Genre:

Research Report > New Finding (0.68)
Research Report > Experimental Study (0.47)

Industry: Materials > Metals & Mining (0.59)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.89)

Add feedback

JaParaPat: A Large-Scale Japanese-English Parallel Patent Application Corpus

Nagata, Masaaki, Chousa, Katsuki, Yasuda, Norihito

arXiv.org Artificial IntelligenceAug-25-2025

We constructed JaParaPat (Japanese-English Parallel Patent Application Corpus), a bilingual corpus of more than 300 million Japanese-English sentence pairs from patent applications published in Japan and the United States from 2000 to 2021. We obtained the publication of unexamined patent applications from the Japan Patent Office (JPO) and the United States Patent and Trademark Office (USPTO). We also obtained patent family information from the DOCDB, that is a bibliographic database maintained by the European Patent Office (EPO). We extracted approximately 1.4M Japanese-English document pairs, which are translations of each other based on the patent families, and extracted about 350M sentence pairs from the document pairs using a translation-based sentence alignment method whose initial translation model is bootstrapped from a dictionary-based sentence alignment method. We experimentally improved the accuracy of the patent translations by 20 bleu points by adding more than 300M sentence pairs obtained from patent applications to 22M sentence pairs obtained from the web.

application, artificial intelligence, natural language, (14 more...)

arXiv.org Artificial Intelligence

2508.16303

Country:

Asia > Japan > Honshū (0.28)
North America > United States > Minnesota (0.28)

Genre: Research Report (0.64)

Industry:

Law > Intellectual Property & Technology Law (1.00)
Government > Regional Government > North America Government > United States Government (0.56)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Subtopic-aware View Sampling and Temporal Aggregation for Long-form Document Matching

Zhou, Youchao, Huang, Heyan, Wu, Zhijing, Liu, Yuhang, Wang, Xinglin

arXiv.org Artificial IntelligenceDec-23-2024

Long-form document matching aims to judge the relevance between two documents and has been applied to various scenarios. Most existing works utilize hierarchical or long context models to process documents, which achieve coarse understanding but may ignore details. Some researchers construct a document view with similar sentences about aligned document subtopics to focus on detailed matching signals. However, a long document generally contains multiple subtopics. The matching signals are heterogeneous from multiple topics. Considering only the homologous aligned subtopics may not be representative enough and may cause biased modeling. In this paper, we introduce a new framework to model representative matching signals. First, we propose to capture various matching signals through subtopics of document pairs. Next, We construct multiple document views based on subtopics to cover heterogeneous and valuable details. However, existing spatial aggregation methods like attention, which integrate all these views simultaneously, are hard to integrate heterogeneous information. Instead, we propose temporal aggregation, which effectively integrates different views gradually as the training progresses. Experimental results show that our learning framework is effective on several document-matching tasks, including news duplication and legal case retrieval.

data mining, large language model, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2412.07573

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > New York > New York County > New York City (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
(9 more...)

Genre: Research Report > New Finding (0.48)

Industry: Law (0.89)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
(3 more...)

Add feedback

Pralekha: An Indic Document Alignment Evaluation Benchmark

Suryanarayanan, Sanjay, Song, Haiyue, Khan, Mohammed Safi Ur Rahman, Kunchukuttan, Anoop, Khapra, Mitesh M., Dabre, Raj

arXiv.org Artificial IntelligenceNov-28-2024

Mining parallel document pairs poses a significant challenge because existing sentence embedding models often have limited context windows, preventing them from effectively capturing document-level information. Another overlooked issue is the lack of concrete evaluation benchmarks comprising high-quality parallel document pairs for assessing document-level mining approaches, particularly for Indic languages. In this study, we introduce Pralekha, a large-scale benchmark for document-level alignment evaluation. Pralekha includes over 2 million documents, with a 1:2 ratio of unaligned to aligned pairs, covering 11 Indic languages and English. Using Pralekha, we evaluate various document-level mining approaches across three dimensions: the embedding models, the granularity levels, and the alignment algorithm. To address the challenge of aligning documents using sentence and chunk-level alignments, we propose a novel scoring method, Document Alignment Coefficient (DAC). DAC demonstrates substantial improvements over baseline pooling approaches, particularly in noisy scenarios, achieving average gains of 20-30% in precision and 15-20% in F1 score. These results highlight DAC's effectiveness in parallel document mining for Indic languages.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2411.19096

Country:

North America > Canada > Ontario > Toronto (0.04)
Europe > Italy > Tuscany > Florence (0.04)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
(4 more...)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Separating Style from Substance: Enhancing Cross-Genre Authorship Attribution through Data Selection and Presentation

Fincke, Steven, Boschee, Elizabeth

arXiv.org Artificial IntelligenceAug-9-2024

The task of deciding whether two documents are written by the same author is challenging for both machines and humans. This task is even more challenging when the two documents are written about different topics (e.g. baseball vs. politics) or in different genres (e.g. a blog post vs. an academic article). For machines, the problem is complicated by the relative lack of real-world training examples that cross the topic boundary and the vanishing scarcity of cross-genre data. We propose targeted methods for training data selection and a novel learning curriculum that are designed to discourage a model's reliance on topic information for authorship attribution and correspondingly force it to incorporate information more robustly indicative of style no matter the topic. These refinements yield a 62.7% relative improvement in average cross-genre authorship attribution, as well as 16.6% in the per-genre condition.

authorship attribution, batch, cross 0, (15 more...)

arXiv.org Artificial Intelligence

2408.05192

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > California (0.14)
Europe > Austria > Vienna (0.14)
North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report (0.64)

Industry:

Education (0.34)
Media > News (0.32)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.68)

Add feedback